Apply spatial operations to answer policy-relevant research questions
Integrate census demographic data with spatial analysis
Create publication-quality visualizations and maps
Work with spatial data from multiple sources
Communicate findings effectively for policy audiences
Part 1: Healthcare Access for Vulnerable Populations
Research Question
Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?
Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.
Required Analysis Steps
Complete the following analysis, documenting each step with code and brief explanations:
Step 1: Data Collection (5 points)
Load the required spatial data:
Pennsylvania county boundaries
Pennsylvania hospitals (from lecture data)
Pennsylvania census tracts
Your Task:
# Load required packageslibrary(sf)library(tidyverse)library(tigris)library(tidycensus)library(scales)library(patchwork)library(here)library(knitr)# Add this near the top of your .qmd after loading librariesoptions(tigris_use_cache =TRUE)options(tigris_progress =FALSE) # Suppress tigris progress bars# Load spatial datapa_counties <-st_read("data/Pennsylvania_County_Boundaries.shp")
Reading layer `Pennsylvania_County_Boundaries' from data source
`/Users/jinhengcen/Documents/portfolio-setup-CenJinHeng/assignments/assignment_2/data/Pennsylvania_County_Boundaries.shp'
using driver `ESRI Shapefile'
Simple feature collection with 67 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -8963377 ymin: 4825316 xmax: -8314404 ymax: 5201413
Projected CRS: WGS 84 / Pseudo-Mercator
hospitals <-st_read("data/hospitals.geojson")
Reading layer `hospitals' from data source
`/Users/jinhengcen/Documents/portfolio-setup-CenJinHeng/assignments/assignment_2/data/hospitals.geojson'
using driver `GeoJSON'
Simple feature collection with 223 features and 11 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -80.49621 ymin: 39.75163 xmax: -74.86704 ymax: 42.13403
Geodetic CRS: WGS 84
census_tracts <-tracts(state ="PA", cb =TRUE)hospitals <-st_transform(hospitals, st_crs(pa_counties))census_tracts <-st_transform(census_tracts, st_crs(pa_counties))# Check that all data loaded correctly
Questions to answer:
How many hospitals are in your dataset? 223
How many census tracts? 3445
What coordinate reference system is each dataset in? I converted them all in WGS 84 / Pseudo-Mercator
Step 2: Get Demographic Data
Use tidycensus to download tract-level demographic data for Pennsylvania.
Required variables:
Total population
Median household income
Population 65 years and over (you may need to sum multiple age categories)
# Join to tract boundariescensus_tracts_joined <- census_tracts %>%left_join(tract_data, by ="GEOID")
Questions to answer:
What year of ACS data are you using? 2023
How many tracts have missing income data? 66
What is the median income across all PA census tracts? 72943.5
Step 3: Define Vulnerable Populations
Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)
Your Task:
# Filter for vulnerable tracts based on your criteriacensus_tracts_joined <- census_tracts_joined %>%mutate(percentage_65 = pop_65/total_populationE )vulnerable_tracts <- census_tracts_joined %>%filter(median_household_incomeE <48000, percentage_65 >0.20 )
Questions to answer:
What income threshold did you choose and why? For the income threshold, I chose 48,000 based on the 2025 poverty guidelines by the USCIS. For a family with four people, 150% of the federal poverty level is $48,225.
What elderly population threshold did you choose and why? For the age threshold, I chose the percentage of people over 65 years old is over 20%, which can be seen as an elderly community.
How many tracts meet your vulnerability criteria? 126
What percentage of PA census tracts are considered vulnerable by your definition? the percentage of vulnerable tracts = 126 / 3445 = 3.66%
Step 4: Calculate Distance to Hospitals
For each vulnerable tract, calculate the distance to the nearest hospital.
Your Task:
# Transform to appropriate projected CRSvulnerable_tracts_proj <- vulnerable_tracts %>%st_transform(5070) # 5070 is for the whole country. For south PA alone, we can use 3365, which is more precise.hospitals_proj <- hospitals %>%st_transform(5070)# Calculate distance from each tract centroid to nearest hospitaltract_centroids <-st_centroid(vulnerable_tracts_proj)nearest_hospital_idx <-st_nearest_feature(tract_centroids, hospitals_proj)dist_to_hospital <-st_distance( tract_centroids, hospitals_proj[nearest_hospital_idx, ],by_element =TRUE)vulnerable_tracts_proj <- vulnerable_tracts_proj %>%mutate(dist_to_hospital =as.numeric(dist_to_hospital),dist_to_hospital_mi = dist_to_hospital *0.000621371 )distant_data <- vulnerable_tracts_proj %>%summarise(avg_distance =mean(dist_to_hospital_mi, na.rm =TRUE),max_dist =max(dist_to_hospital_mi, na.rm =TRUE), )num_over15 <- vulnerable_tracts_proj %>%filter(dist_to_hospital_mi >15) %>%summarise(n_over15 =n(), )
Requirements:
Use an appropriate projected coordinate system for Pennsylvania
Calculate distances in miles
Explain why you chose your projection
Questions to answer:
What is the average distance to the nearest hospital for vulnerable tracts? 2.8 miles
What is the maximum distance? 18.6 miles
How many vulnerable tracts are more than 15 miles from the nearest hospital? 1
Step 5: Identify Underserved Areas
Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.
What percentage of vulnerable tracts are underserved? 1/126 = 0.79%
Does this surprise you? Why or why not? No, I looked at the distance data of the vulnerable tracts, most of them (92.9%) are under 10 miles (16000m). I think 15 miles is a very big number, so we just have one tract fulfill all the requirements.
Step 6: Aggregate to County Level
Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.
Percentage of vulnerable tracts that are underserved
Average distance to nearest hospital for vulnerable tracts
Total vulnerable population
Questions to answer:
Which 5 counties have the highest percentage of underserved vulnerable tracts? CAMERON is the highest, with 100% in the number. Since I only have 1 underserved tract, I would show top 5 counties with the highest percentage of vulnerable tracts:
CAMERON 50.0%
FAYETTE 19.4%
CAMBRIA 19.0%
JEFFERSON 15.4%
WARREN 15.4%
Which counties have the most vulnerable people living far from hospitals? PHILADELPHIA, with the number of 18022
Are there any patterns in where underserved counties are located? Since I only have 1 underserved tract, I would observe patterns of vulnerable counties.
Most of them are from rural areas with less tracts number, so they have higher percentage of vulnerable tracts
They have large numbers of vulnerable population, which means lots of low-income and old people living there. Most of them have number over 1000.
Step 7: Create Summary Table
Create a professional table showing the top 10 priority counties for healthcare investment.
#only one very small blue tract in the center of the map
Requirements:
Show underserved vulnerable tracts in a contrasting color
Include county boundaries for context
Show hospital locations
Use appropriate visual hierarchy (what should stand out?)
Include informative title and subtitle
Chart: Distribution Analysis
Create a visualization showing the distribution of distances to hospitals for vulnerable populations.
Your Task:
# Create distribution visualization# Histogram or density plot of distancesggplot(vulnerable_tracts_proj) +aes(x = dist_to_hospital_mi) +geom_histogram(bins =15, fill ="steelblue", alpha =0.7) +labs(title ="Distribution of distance to the nearest hospital",x ="distance to the nearest hospital",y ="Number of vulnerable tracts",caption ="Most of tracts have distance between 0 and 5." ) +theme_minimal()
# Box plot comparing distances across regionsggplot(vulnerable_tracts_proj) +aes(x = NAMELSADCO, y = dist_to_hospital_mi, fill = NAMELSADCO) +geom_boxplot() +labs(title ="Distances by County Category",x ="County",y ="Distance",caption ="While most counties show little variation, four counties have huge internal differences" ) +theme_minimal() +theme(legend.position ="none")+theme(axis.text.x =element_text(angle =90, vjust =0.5, hjust =1, size =7))
# Bar chart of underserved tracts by county#only 1 underserved tracts in my study, it is Census Tract 9601 in the Cameron County# Scatter plot of distance vs. vulnerable population sizeggplot(vulnerable_tracts_proj) +aes(x = dist_to_hospital_mi, y = total_populationE) +geom_point(color ="steelblue", alpha =0.6, size =2) +geom_smooth(method ="lm", se =TRUE, color ="darkred", fill ="pink", linewidth =1 ) +labs(title ="distance vs. vulnerable population size",x ="Distance to the nearest hospital",y ="Vulnerable population size",caption =str_wrap("More vulnerable population, little distance to hospital. Since hospitals are designed to located near population clusters, and we didn't limit this effect in this study ", width =90) ) +theme_minimal() +scale_x_continuous() +scale_y_continuous(labels = comma)
Suggested chart types:
Histogram or density plot of distances
Box plot comparing distances across regions
Bar chart of underserved tracts by county
Scatter plot of distance vs. vulnerable population size
Requirements:
Clear axes labels with units
Appropriate title
Professional formatting
Brief interpretation (1-2 sentences as a caption or in text)
Part 3: Bring Your Own Data Analysis
Choose your own additional spatial dataset and conduct a supplementary analysis.
Challenge Options
Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/).
Note these are just loose suggestions to spark ideas - follow or make your own as the data permits and as your ideas evolve. This analysis should include bringing in your own dataset, ensuring the projection/CRS of your layers align and are appropriate for the analysis (not lat/long or geodetic coordinate systems). The analysis portion should include some combination of spatial and attribute operations to answer a relatively straightforward question
Education & Youth Services
Option A: Educational Desert Analysis
Data: Schools, Libraries, Recreation Centers, Census tracts (child population)
Question: “Which neighborhoods lack adequate educational infrastructure for children?”
Operations: Buffer schools/libraries (0.5 mile walking distance), identify coverage gaps, overlay with child population density
Policy relevance: School district planning, library placement, after-school program siting
Option B: School Safety Zones - Data: Schools, Crime Incidents, Bike Network - Question: “Are school zones safe for walking/biking, or are they crime hotspots?” - Operations: Buffer schools (1000ft safety zone), spatial join with crime incidents, assess bike infrastructure coverage - Policy relevance: Safe Routes to School programs, crossing guard placement
Environmental Justice
Option C: Green Space Equity
Data: Parks, Street Trees, Census tracts (race/income demographics)
Question: “Do low-income and minority neighborhoods have equitable access to green space?”
Operations: Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics
Most datasets available as GeoJSON, Shapefile, or CSV with coordinates
Always check the Metadata for a data dictionary of the fields.
Additional Sources:
Pennsylvania Open Data: https://data.pa.gov/
Census Bureau (via tidycensus): Demographics, economic indicators, commute patterns
TIGER/Line (via tigris): Geographic boundaries
Recommended Starting Points
If you’re feeling confident: Choose an advanced challenge with multiple data layers. If you are a beginner, choose something more manageable that helps you understand the basics
If you have a different idea: Propose your own question! Just make sure:
You can access the spatial data
You can perform at least 2 spatial operations
Your Analysis
Your Task:
Find and load additional data
Document your data source
Check and standardize the CRS
Provide basic summary statistics
# Load your additional datasetlibrary(sf)schools <-st_read("Data/Schools/Schools.shp")
Reading layer `Schools' from data source
`/Users/jinhengcen/Documents/portfolio-setup-CenJinHeng/assignments/assignment_2/data/Schools/Schools.shp'
using driver `ESRI Shapefile'
Simple feature collection with 495 features and 14 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -8378628 ymin: 4852555 xmax: -8345686 ymax: 4884813
Projected CRS: WGS 84 / Pseudo-Mercator
crime <-st_read("Data/Crime/Crime.shp")
Reading layer `Crime' from data source
`/Users/jinhengcen/Documents/portfolio-setup-CenJinHeng/assignments/assignment_2/data/Crime/Crime.shp'
using driver `ESRI Shapefile'
replacing null geometries with empty geometries
Simple feature collection with 160388 features and 13 fields (with 6744 geometries empty)
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -80.50237 ymin: 39.87688 xmax: -74.95784 ymax: 42.22434
Geodetic CRS: WGS 84
Reading layer `Bike_Network' from data source
`/Users/jinhengcen/Documents/portfolio-setup-CenJinHeng/assignments/assignment_2/data/Bike_Network/Bike_Network.shp'
using driver `ESRI Shapefile'
Simple feature collection with 5225 features and 8 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: -8378937 ymin: 4847835 xmax: -8345146 ymax: 4883978
Projected CRS: WGS 84 / Pseudo-Mercator
schools <-st_transform(schools, 2272)crime <-st_transform(crime, 2272)bike_network <-st_transform(bike_network, 2272)library(dplyr)library(ggplot2)## school typeschools %>%st_drop_geometry() %>%count(type_speci, sort =TRUE) %>%ggplot(aes(x =reorder(type_speci, n), y = n)) +geom_bar(stat ="identity", fill ="lightblue") +coord_flip() +labs(title ="Distribution of school types",x ="School type",y ="Number" ) +theme_minimal(base_size =13)
# Crime Hourcrime %>%st_drop_geometry() %>%count(hour) %>%ggplot(aes(x = hour, y = n)) +geom_col(fill ="firebrick") +labs(title ="Distribution of crime hour",x ="Hour",y ="Number of incidents" ) +theme_minimal(base_size =13)
#crime typescrime %>%st_drop_geometry() %>%count(text_gener, sort =TRUE) %>%slice_max(n, n =10) %>%ggplot(aes(x =reorder(text_gener, n), y = n)) +geom_bar(stat ="identity", fill ="orange") +coord_flip() +labs(title ="Top 10 Crime Types",x ="Types",y ="Number of incidents" ) +theme_minimal(base_size =13)
# Bike lanebike_length_summary <- bike_network %>%st_drop_geometry() %>%group_by(TYPE) %>%summarise(total_length =sum(Shape__Len, na.rm =TRUE)) %>%arrange(desc(total_length))ggplot(bike_length_summary, aes(x =reorder(TYPE, total_length), y = total_length)) +geom_bar(stat ="identity", fill ="forestgreen") +coord_flip() +labs(title ="Distribution of Bike Lane Type",x ="Type",y ="Total Length" ) +theme_minimal(base_size =13)
Questions to answer:
What dataset did you choose and why? Three data sets are chosen:
Schools data in shp format
Crime incidents data points in shp format
Bike_Network in shp format Since I am going to calculate some metrics like the numbers of crime incidents and the length of bike lanes within school buffers, these data are needed.
What is the data source and date? They are all from OpenDataPhilly. All of them are up_to_date. The crime incidents data is from 2024, because we are in the October of the 2025, so we don’t have data for the whole 2025. Using 2024 can avoid some time variation.
How many features does it contain? All of them have geometry, which means we can run them in spatial analysis. For the school data set, we got names, types, grade levels, phone numbers, street names, and zip codes. For the crime data, we have precise time, crime types, location details. For the bike network data, we knew street names, types, lengths and oneway.
What CRS is it in? Did you need to transform it? EPSG:2272, which is used in south Pennsylvania.
Pose a research question
Write a clear research statement that your analysis will answer.
Examples:
“Do vulnerable tracts have adequate public transit access to hospitals?”
“Are EMS stations appropriately located near vulnerable populations?”
“Do areas with low vehicle access have worse hospital access?”
My question is “Are school zones safe for walking/biking, or are they crime hotspots?”
Conduct spatial analysis
Use at least TWO spatial operations to answer your research question.
Required operations (choose 2+):
Buffers
Spatial joins
Spatial filtering with predicates
Distance calculations
Intersections or unions
Point-in-polygon aggregation
Your Task:
# Your spatial analysis# I have loaded data sets and make simple statistical analysis for each of them in the previous chunk# Create buffer for schoolsschools <- schools %>%mutate(school_id =row_number())zones <-st_buffer(schools, dist =1000)# Calculate the crime numbercrime_by_zone <- crime %>%st_join(zones %>%select(school_id), join = st_within) %>%st_drop_geometry() %>%count(school_id, name ="crime_count")# Calculate the bike lane lengthbike_in_zone <-st_intersection(bike_network, zones %>%select(school_id))bike_len_by_zone <- bike_in_zone %>%mutate(len_ft =as.numeric(st_length(.))) %>%st_drop_geometry() %>%group_by(school_id) %>%summarise(bike_len_ft =sum(len_ft, na.rm =TRUE), .groups ="drop")# Combine the resultzone_stats <- zones %>%left_join(crime_by_zone, by ="school_id") %>%left_join(bike_len_by_zone, by ="school_id")# Calculate average number of the crime incidents and bike lane lengthzone_summary <- zone_stats %>%st_drop_geometry() %>%summarise(mean_crime =mean(crime_count, na.rm =TRUE),mean_bike_len =mean(bike_len_ft, na.rm =TRUE) )city_summary <-tibble(mean_crime =nrow(crime) /nrow(zones),mean_bike_len =sum(st_length(bike_network)) /nrow(zones))# Combine resultscomparison <-rbind(data.frame(Category ="School Zones", mean_crime = zone_summary$mean_crime, mean_bike_len = zone_summary$mean_bike_len),data.frame(Category ="Citywide Average",mean_crime = city_summary$mean_crime,mean_bike_len = city_summary$mean_bike_len))# Draw diagramggplot(comparison, aes(x = Category, y = mean_crime, fill = Category)) +geom_bar(stat ="identity", width =0.6) +labs(title ="Comperison of Average Crime Incidents: School Zones VS City",y ="Averge Crime Incidents" ) +theme_minimal(base_size =14) +theme(legend.position ="none")
ggplot(comparison, aes(x = Category, y = mean_bike_len, fill = Category)) +geom_bar(stat ="identity", width =0.6) +labs(title ="Comperison of Bike Lane Length: School Zones VS City",y ="Average Bike Lane Length" ) +theme_minimal(base_size =14) +theme(legend.position ="none")
# Mapping# Get tracts of Phillytracts_phl <-get_acs(geography ="tract",state ="PA",county ="Philadelphia",variables ="B01003_001",survey ="acs5",year =2022,geometry =TRUE,cache_table =TRUE) %>%select(GEOID, NAME, pop = estimate, geometry) %>%st_transform(2272) # point_on_surfacezone_points <- zone_stats |>mutate(geometry =st_point_on_surface(geometry),bike_len_mile = bike_len_ft /5280) |>st_as_sf()# Crime point mapggplot() +geom_sf(data = tracts_phl, fill =NA, color ="grey", linewidth =0.2) +geom_sf(data = zone_points, aes(size = crime_count), color ="firebrick", alpha =0.6) +scale_size_continuous(name ="Crime count", range =c(0, 4)) +labs(title ="School Zones: Crime Counts (Point size = count)", x =NULL, y =NULL) +theme_minimal(base_size =13) +theme(legend.position ="right")
# Bike lane point mapggplot() +geom_sf(data = tracts_phl, fill =NA, color ="grey", linewidth =0.2) +geom_sf(data = zone_points, aes(size = bike_len_mile), color ="forestgreen", alpha =0.6) +scale_size_continuous(name ="Bike lane (miles)", range =c(0.1, 2)) +labs(title ="School Zones: Bike Lane Length (Point size = miles)", x =NULL, y =NULL) +theme_minimal(base_size =13) +theme(legend.position ="right")
Analysis requirements:
Clear code comments explaining each step
Appropriate CRS transformations
Summary statistics or counts
At least one map showing your findings
Brief interpretation of results (3-5 sentences)
Your interpretation:
School zones have fewer crime incidents and bike lane length compared to city average. More pedestrian-friendly projects should be applied to school areas. For the spatial difference, higher concentrations of crimes are observed around schools in central and southern Philadelphia. School zones with longer bike lane coverage are mostly located in the city’s core and western areas. Planning efforts should focus on improving safety in school zones with higher crime risks, especially in central and southern Philadelphia. Expanding connected and protected bike lanes around schools can promote safer and more accessible environments for students.
Finally - A few comments about your incorporation of feedback!
Take a few moments to clean up your markdown document and then write a line or two or three about how you may have incorporated feedback that you recieved after your first assignment.
I need to hide my API Key and printed tables!
Submission Requirements
What to submit:
Rendered HTML document posted to your course portfolio with all code, outputs, maps, and text
Use embed-resources: true in YAML so it’s a single file
All code should run without errors
All maps and charts should display correctly
File naming:LastName_FirstName_Assignment2.html and LastName_FirstName_Assignment2.qmd